{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9fbd32ce",
   "metadata": {},
   "source": [
    "# Medical Research Project\n",
    "\n",
    "[paperai](https://github.com/neuml/paperai) is an AI application for medical and scientific papers.\n",
    "\n",
    "This notebook will show an end to end process that builds a research report on `Colon Cancer in Young Adults`, a growing concern. We'll filter PubMed abstracts, build an `txtai` embeddings index and then run a research report."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bdfe887",
   "metadata": {},
   "source": [
    "# Install dependencies\n",
    "\n",
    "Install `paperetl`, `paperai` and all dependencies. This step also downloads input data to process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fa73395",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install git+https://github.com/neuml/paperai\n",
    "\n",
    "# Download NLTK data\n",
    "!python -c \"import nltk; nltk.download(['punkt', 'punkt_tab', 'averaged_perceptron_tagger_eng'])\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9aaa6b9b",
   "metadata": {},
   "source": [
    "# Build article repository\n",
    "\n",
    "The first step is filtering through the PubMed abstracts matching `Colon Cancer`. We'll filter for articles matching the MeSH code [D015179](https://meshb.nlm.nih.gov/record/ui?ui=D015179).\n",
    "\n",
    "This notebook assumes the [PubMed baseline dataset](https://pubmed.ncbi.nlm.nih.gov/download/) is available in the local `/data/sources/pubmed/data` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be6201c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m paperetl.file /data/sources/pubmed/data /data/sources/pubmed/subsets/CRC/data /data/sources/pubmed/subsets/CRC/config"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "259dfdcc",
   "metadata": {},
   "source": [
    "Once complete, there will be the following SQLite database with over 100K article abstracts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8e163708",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "109054\n"
     ]
    }
   ],
   "source": [
    "!sqlite3 /data/sources/pubmed/subsets/CRC/data/articles.sqlite \"SELECT COUNT(*) FROM articles\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "0697c29b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A phase II trial of interferon alpha-2b with folinic acid and 5-fluorouracil administered by 4-hour infusion in metastatic colorectal carcinoma.\n",
      "In vitro modulation of haematoporphyrin derivative photodynamic therapy on colorectal carcinoma multicellular spheroids by verapamil.\n",
      "Accurate method to measure the percentage hepatic replacement by tumour and its use in prognosis of patients with advanced colorectal cancer.\n"
     ]
    }
   ],
   "source": [
    "!sqlite3 /data/sources/pubmed/subsets/CRC/data/articles.sqlite \"SELECT Title FROM articles LIMIT 3\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0793a794",
   "metadata": {},
   "source": [
    "# Build the paperai index\n",
    "\n",
    "Now that the articles have been parsed, let's build a paperai index. We'll index the articles and vectorize each section using the [PubMedBERT Embeddings](https://huggingface.co/NeuML/pubmedbert-base-embeddings) vector model. For this example, we'll limit the index to the top 10,000 most cited articles within the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29d91a62",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m paperai.index /data/sources/pubmed/subsets/colorectal/data neuml/pubmedbert-base-embeddings 0 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e331672",
   "metadata": {},
   "source": [
    "Once complete, additional index files will now be available in the data directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1afae1fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "articles.sqlite  config.json  embeddings  ids\n"
     ]
    }
   ],
   "source": [
    "!ls /data/sources/pubmed/subsets/CRC/data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cffdc07e",
   "metadata": {},
   "source": [
    "# Research report\n",
    "\n",
    "Now that the data is all ready, let's build a report!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff01f30f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile crc.yml\n",
    "name: ColonCancer\n",
    "options:\n",
    "    llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf\n",
    "    system: You are a medical literature document parser. You extract fields from data.\n",
    "    template: |\n",
    "        Quickly extract the following field using the provided rules and context.\n",
    "\n",
    "        Rules:\n",
    "          - Keep it simple, don't overthink it\n",
    "          - ONLY extract the data\n",
    "          - NEVER explain why the field is extracted\n",
    "          - NEVER restate the field name only give the field value\n",
    "          - Say no data if the field can't be found within the context\n",
    "\n",
    "        Field:\n",
    "        {question}\n",
    "\n",
    "        Context:\n",
    "        {context}\n",
    "\n",
    "    context: 5\n",
    "    params:\n",
    "        maxlength: 4096\n",
    "        stripthink: True\n",
    "\n",
    "Research:\n",
    "    query: colon cancer young adults\n",
    "    columns:\n",
    "        - name: Date\n",
    "        - name: Study\n",
    "        - name: Study Link\n",
    "        - name: Journal\n",
    "        - {name: Sample Size, query: number of patients, question: Sample Size}\n",
    "        - {name: Objective, query: objective, question: Study Objective}\n",
    "        - {name: Causes, query: possible causes, question: List of possible causes}\n",
    "        - {name: Detection, query: diagnosis, question: List of ways to diagnose}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ec4f8ea",
   "metadata": {},
   "source": [
    "This report defines the report name, the [RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/) parameters and columns to generate. We'll use a local LLM designed for working with medical literature but any supported txtai LLM can be used (Transformers, llama.cpp, APIs such as OpenAI/Claude etc). See the [LLM pipeline](https://neuml.github.io/txtai/pipeline/text/llm/) for more on this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fff4073f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m paperai.report report.yml 10 csv /data/sources/pubmed/subsets/CRC/data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "fc70a882",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Date</th>\n",
       "      <th>Study</th>\n",
       "      <th>Study Link</th>\n",
       "      <th>Journal</th>\n",
       "      <th>Sample Size</th>\n",
       "      <th>Objective</th>\n",
       "      <th>Causes</th>\n",
       "      <th>Detection</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>2023-04-27</td>\n",
       "      <td>Young-onset colorectal cancer.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/37105987</td>\n",
       "      <td>Nature reviews. Disease primers</td>\n",
       "      <td>no data</td>\n",
       "      <td>To better understand the underlying mechanism and define relationships between environmental factors and YO-CRC development, long-term prospective studies are needed with lifestyle data collected from childhood.</td>\n",
       "      <td>hereditary cancer syndrome, antibiotic use, low physical activity, obesity, environmental factors</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2020-04-06</td>\n",
       "      <td>Trends in the epidemiology of young-onset colorectal cancer: a worldwide systematic review.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/32252672</td>\n",
       "      <td>BMC cancer</td>\n",
       "      <td>no data</td>\n",
       "      <td>Assess trends in young-onset colorectal cancer (yCRC) incidence and pooled overall APCi.</td>\n",
       "      <td>- Increasing prevalence and incidence trends of young-onset colorectal cancer (yCRC) in adults under 50 years, with annual percentage changes (APCp +2.6, APCi up to +4.03)  \\n- Increased risk of rectal cancer in adults less than 50 years (9/14 studies) with significant annual percentage change in incidence (p &lt; 0.001)</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2019-02-28</td>\n",
       "      <td>Early-onset colorectal cancer in young individuals.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/30520562</td>\n",
       "      <td>Molecular oncology</td>\n",
       "      <td>no data</td>\n",
       "      <td>Provide considerations for future perspectives.</td>\n",
       "      <td>hereditary cancer predisposing syndromes, familial CRC</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2018-03-28</td>\n",
       "      <td>Colorectal Cancer in the Young.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/29616330</td>\n",
       "      <td>Current gastroenterology reports</td>\n",
       "      <td>no data</td>\n",
       "      <td>The Study Objective is to determine whether current screening approaches should be modified and whether causation and treatment options may differ in a molecular subset of EOCRCs.</td>\n",
       "      <td>no data</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2017-05-15</td>\n",
       "      <td>The Growing Challenge of Young Adults With Colorectal Cancer.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/28516436</td>\n",
       "      <td>Oncology (Williston Park, N.Y.)</td>\n",
       "      <td>no data</td>\n",
       "      <td>address specific issues pertaining to AYA patients with colorectal cancer, including evaluation for hereditary colorectal cancer syndromes, clinicopathologic and biologic features unique to AYA patients with colorectal cancer, treatment outcomes, and survivorship.</td>\n",
       "      <td>no data</td>\n",
       "      <td>evaluation for hereditary colorectal cancer syndromes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2017-02-28</td>\n",
       "      <td>Colorectal cancer is a leading cause of cancer incidence and mortality among adults younger than 50 years in the USA: a SEER-based analysis with comparison to other young-onset cancers.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/27864324</td>\n",
       "      <td>Journal of investigative medicine : the official publication of the American Federation for Clinical Research</td>\n",
       "      <td>no data</td>\n",
       "      <td>summarize extracted data, both overall, and stratified by sex.</td>\n",
       "      <td>breast cancer, lung cancer</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2016-04-07</td>\n",
       "      <td>Different risk factors for advanced colorectal neoplasm in young adults.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/27053853</td>\n",
       "      <td>World journal of gastroenterology</td>\n",
       "      <td>70428</td>\n",
       "      <td>To evaluate and compare odds ratios (OR) for ACRN between young-adults (YA &lt; 50 years) and older-adults (OA ≥ 50 years).</td>\n",
       "      <td>age, male sex, current smoking, family history of colorectal cancer, diabetes mellitus related factors, obesity, CEA, low-density lipoprotein-cholesterol</td>\n",
       "      <td>colonoscopy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2015-11-01</td>\n",
       "      <td>High Prevalence of Hereditary Cancer Syndromes in Adolescents and Young Adults With Colorectal Cancer.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/26195711</td>\n",
       "      <td>Journal of clinical oncology : official journal of the American Society of Clinical Oncology</td>\n",
       "      <td>no data</td>\n",
       "      <td>The Study Objective is to determine whether patients diagnosed with colorectal cancer at age 35 years or younger should receive genetic counseling regardless of their family history and phenotype.</td>\n",
       "      <td>hereditary cancer syndromes</td>\n",
       "      <td>genetic counseling</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2015-03-30</td>\n",
       "      <td>Colorectal cancer in young adults.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/25480403</td>\n",
       "      <td>Digestive diseases and sciences</td>\n",
       "      <td>no data</td>\n",
       "      <td>Further studies are needed regarding this topic.</td>\n",
       "      <td>behavioral and environmental causes, inherited CRC syndromes</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2004-03-30</td>\n",
       "      <td>Colorectal cancer in the young.</td>\n",
       "      <td>https://pubmed.ncbi.nlm.nih.gov/15006562</td>\n",
       "      <td>American journal of surgery</td>\n",
       "      <td>55</td>\n",
       "      <td>Characterize CRC in the young population and determine how CRC in this population should be further addressed regarding detection and treatment.</td>\n",
       "      <td>no data</td>\n",
       "      <td>no data</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "display(HTML(pd.read_csv(\"Research.csv\").to_html(index=False)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49350895",
   "metadata": {},
   "source": [
    "The results show the fields we asked for and when available, the generated data. This is a great way to do a deep pass over a large dataset to identify the most relevant articles to deep dive into during the research process. \n",
    "\n",
    "An important philosophy of `paperai` is that it's not using LLMs to summarize and abstract away the raw data. It's building a helpful index that can enable researchers to more easily get at the raw data they care about."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "local",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
